Probabilistic and possibilistic language models based on the world wide web

نویسندگان

Stanislas Oger

Vladimir Popescu

Georges Linarès

چکیده

Usually, language models are built either from a closed corpus, or by using World Wide Web retrieved documents, which are considered as a closed corpus themselves. In this paper we propose several other ways, more adapted to the nature of the Web, of using this resource for language modeling. We first start by improving an approach consisting in estimating n-gram probabilities from Web search engine statistics. Then, we propose a new way of considering the information extracted from the Web in a probabilistic framework. Then, we also propose to rely on Possibility Theory for effectively using this kind of information. We compare these two approaches on two automatic speech recognition tasks: (i) transcribing broadcast news data, and (ii) transcribing domain-specific data, concerning surgical operation film comments. We show that the two approaches are effective in different situations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web-based possibilistic language models for automatic speech recognition

This paper describes a new kind of language models based on the Possibility Theory. The purpose of these new models is to better use the data available on the Web for language modeling. These models aim to integrate information relative to impossible word sequences. We address the two main problems of using this kind of model: how to estimate the measures for word sequences and how to integrate...

متن کامل

Modèles de langage ad hoc pour la reconnaissance automatique de la parole. (Ad-hoc language models for automatic speech recognition)

The three pillars of an automatic speech recognition system are the lexicon, the language model and the acoustic model. The lexicon provides all the words that can be transcribed, associated with their pronunciation. The acoustic model provides an indication of how the phone units are pronounced, and the language model brings the knowledge of how words are linked. In modern automatic speech rec...

متن کامل

A Retrieval Method Based on Language Model Considering Neighboring Contents

The World Wide Web (WWW) has a massive number of Web pages, so that it is difficult for users to get useful information. In recent years, however, it is said that the probabilistic language model can help to improve retrieval accuracies of some kinds of search engines. The probabilistic language model has statistical background and can adapt previous text information retrieval model. However, w...

متن کامل

CLIR using a Probabilistic Translation Model based on Web Documents

In this report, we describe the approach we used in TREC-8 Cross-Language IR (CLIR) track. The approach is based on probabilistic translation models estimated from two parallel training corpora: one established manually, and the other built automatically with the documents mined from the Web. We describe the principle of model building, the mining of parallel texts, as well as some preliminary ...

متن کامل

Combination of probabilistic and possibilistic language models

In a previous paper we proposed Web-based language models relying on the possibility theory. These models explicitly represent the possibility of word sequences. In this paper we propose to find the best way of combining this kind of model with classical probabilistic models, in the context of automatic speech recognition. We propose several combination approaches, depending on the nature of th...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Probabilistic and possibilistic language models based on the world wide web

نویسندگان

چکیده

منابع مشابه

Web-based possibilistic language models for automatic speech recognition

Modèles de langage ad hoc pour la reconnaissance automatique de la parole. (Ad-hoc language models for automatic speech recognition)

A Retrieval Method Based on Language Model Considering Neighboring Contents

CLIR using a Probabilistic Translation Model based on Web Documents

Combination of probabilistic and possibilistic language models

عنوان ژورنال:

اشتراک گذاری